Comparing study features is easy but identifying next steps is hard: Evaluating critical thinking through the Biology Lab Inventory of Critical Thinking in Ecology

Abstract Critical thinking, which can be defined as the evidence‐based ways in which people decide what to trust and what to do, is an important competency included in many undergraduate science, technology, engineering, and mathematics (STEM) courses. To help instructors effectively measure critical thinking, we developed the Biology Lab Inventory of Critical Thinking in Ecology (Eco‐BLIC), a freely available, closed‐response assessment of undergraduate students' critical thinking in ecology. The Eco‐BLIC includes ecology‐based experimental scenarios followed by questions that measure how students decide on what to trust and what to do next. Here, we present the development of the Eco‐BLIC using tests of validity and reliability. Using student responses to questions and think‐aloud interviews, we demonstrate the effectiveness of the Eco‐BLIC at measuring students' critical thinking skills. We find that while students generally think like experts while evaluating what to trust, students' responses are less expert‐like when deciding on what to do next.

small sample sizes, or data types prone to systematic issues that can make drawing conclusions difficult (Karban et al., 2014;Kjelvik & Schultheis, 2019). Because current issues in ecology can often impact public discourse (e.g., climate change and biodiversity), it is important that students learn how to evaluate the trustworthiness of data (McCright, 2011). Students learn about what to do by controlling confounding variables, making inferences, and distinguishing between correlation and causation (Bonner et al., 2017;Kjelvik & Schultheis, 2019;Mourad et al., 2012). Students also learn that experimental manipulations can sometimes be impossible due to ethical or logistical constraints (Karban et al., 2014).

| Assessment framework
One way to help instructors measure critical thinking in ecology courses is to provide evidence-based assessment instruments that focus on critical thinking. To date, most studies that assessed critical thinking in ecology field courses used student self-reports (e.g., writing reflections and self-assessment of their learning gains; McLaughlin et al., 2018) or qualitative evidence of critical thinking gains (Gillie & Bizub, 2012). Instructors could use several STEM (i.e., science, technology, engineering, and mathematics) assessments to measure critical thinking in their courses (Table 1). These instruments include content ranging from broad STEM to biology-specific topics. Many of the assessments, however, are open-response and may be challenging to score with large classes. In addition, a key design choice in critical thinking assessment is the inclusion of questions that explicitly probe students' evaluations of what to trust and what to do, which aligns with the definition of critical thinking from Walsh et al. (2019). Several assessments probe student understanding of data and methods ("What to trust"), but few also ask students to evaluate proposed next steps in a scientific investigation ("What to do"; Table 1).
Research also suggests that critical thinking is context-and domain-specific (Pithers & Soden, 2000;Willingham, 2008). Thus, critical thinking assessments should be embedded in a domain or disciplinary context, such as ecology. To disentangle the assessment of critical thinking skills from the assessment of students' knowledge about the context, one strategy is for the disciplinary context of the assessment to be accessible, such that all content knowledge needed to effectively complete the questions is present and at an appropriate content-level for participants (Schwartz et al., 2016). While some of the available STEM critical thinking instruments summarized in Table 1 are biology-specific, none to date have an ecology-specific context.
Other design considerations include the structure and availability of the questions themselves (Table 1). In particular, students can better critique experimental scenarios when asked to explicitly compare and contrast, as opposed to evaluating each in turn (Heim et al., 2022). Open-response and closed-response formats also likely elicit different forms or levels of critical thinking (Pate, 2012). For example, while open-response questions may elicit more creativity in the exploration of concepts or topics, closed-response questions allow for more focused comparisons between groups or ideas (Pate, 2012;Quinn et al., 2018). Additionally, closed-response instruments better meet the need for large-scale evaluation of ecology courses because they can be scored and analyzed more quickly than open-response instruments. Finally, freely available instruments are more accessible to instructors than ones that require payment.

| Purpose and research aims
The Biology Lab Inventory of Critical Thinking in Ecology (Eco-BLIC) assesses students' critical thinking skills related to experimentation in ecology. Our goal was to create an assessment based on the design principles in Table 1: closed-response compare and contrast questions, discipline-specific, and freely available. This instrument would complement existing critical thinking assessments (Table 1) and provide a novel way to assess undergraduate student critical thinking in courses that include ecology. Using frameworks from Vision and Change (AAAS, 2011) and the Advancing Competencies in Experimentation-Biology (ACE Bio) Network (Pelaez et al., 2018), and building from a related instrument in physics, the Physics Lab Inventory of Critical Thinking (PLIC; Quinn et al., 2018;Walsh et al., 2018;Walsh et al., 2019), we created experimental scenarios and questions intended to probe students' critical thinking skills.
The scenarios and questions were designed to assess a range of students from multiple institutions and were research-validated following standard procedures, including comparing open-and closed-response versions, interviewing students, administering the assessments in multiple institutional contexts, and getting feedback from experts (Adams & Wieman, 2011). In this article, we answer the following research questions: 1. What is the statistical evidence of validity and reliability for the Eco-BLIC? 2. How do student and expert responses align when evaluating the two components of critical thinking (i.e., what to trust and what to do)? 2 | ME THODS

| Question development & format
We developed Eco-BLIC questions through an iterative and stepwise process aligned with the standards of instrument design (De Vellis, 2003); these stepwise processes are further described below.
Others have used similar methods to design biology concept assessments (Bass et al., 2016;Couch et al., 2015;NRC, 2012;Smith et al., 2008; Table 2). The Eco-BLIC development is similar to the approach used for the Physics Lab Inventory of Critical Thinking (PLIC; Walsh et al., 2019). The PLIC is a 10-question, closed-response assessment that presents the t experimental methods and findings from two hypothetical physics research groups, one which uses a simpler approach and the other which uses a more complex approach (Walsh et al., 2019) Both are testing the relationship between the period of oscillation of a mass hanging from a spring. The questions ask respondents to evaluate the data and methods and propose next steps for each group. The PLIC underwent similar development, validity, and reliability testing as those presented here for the Eco-BLIC (see Walsh et al., 2019 for details).
The Eco-BLIC is administered via Qualtrics and provides students with experimental scenarios in which they learn about how different researchers approach answering the same question about feeding behaviors in a specific predator-prey relationship (Appendix A). Predator-prey relationships are commonly encountered in high school and introductory biology and ecology courses (Ginovart, 2014;Wasson, 2021) and often employ easy-to-analyze organism count data, thus making the content in the Eco-BLIC broadly accessible.
Students engage with two scenarios. One scenario is based on relationships between smallmouth bass (Micropterus dolomieu) and comb-mouthed minnow mayflies (Ameletus cryptostimulus), while the second is based on great-horned owls (Bubo virginianus) and house mice (Mus musculus). In the bass-mayfly scenario, students explore whether smallmouth bass selectively feed on larger or smaller mayflies. In the owl-mouse scenario, students explore how the presence of a great-horned owl influences the amount of time that mice spend feeding. As the Eco-BLIC is intended to measure critical thinking, students are not required to have extensive content knowledge beyond the information that is provided in the scenario prompts.
Although the scenarios are presented across multiple pages, students may go back to earlier pages in order to limit cognitive load.
Within each scenario, there are two research groups-one conducts their study in a laboratory-based setting, while the other conducts their study in a field-based setting. The descriptive prompts for each research group include a figure showing data, from which TA B L E 1 Design principles of existing critical thinking and experimental design assessments that could be used in biology courses. students are expected to form hypotheses and draw conclusions.
There is a multiple-choice prompt comprehension question asking students to interpret a figure (Table 3) and an open-response question asking students to explain their reasoning for their initial hypothesis. Students are later asked to compare the experiments in these two distinct settings, lab versus field scenarios. There is not one perfect and one problematic research group, as each group's study has both strong and weak features.
The two primary types of scored questions included in the  Table 3).
At the end of the instrument, students are asked to complete a short demographic survey, including questions about race/ethnicity, gender, major, and prior research experience (Appendix A).

| Participants and institutions
We administered versions of the Eco-BLIC to students from a diverse range of institutions (Table 4). We recruited participants mainly through professional organization listservs and focused emails to potentially interested instructors; we only required that participating courses focus on "ecology concepts and topics." Approximately 30% of students were first years, 18% were sophomores, 24% were juniors, and 28% were seniors. Nearly 66% had declared a major in biology or another life science. Over 60% of participating students identified as women and most students identified as White (56%), Hispanic or Latinx (19%), or Asian (17%). Most participants were recruited from general ecology (56%) and general biology (13%) courses, though the remaining 30% of participating courses covered broad topics (i.e., introductory courses in evolution and integrative biology and chemistry, and advanced courses in field biology, ecology, aquatic biology, botany, and ornithology). The majority of participating students were from introductory courses (91%), while 9% were from advanced courses. Participating courses had enrollment sizes ranging from approximately 10-350 students, with an average enrollment of approximately 100 (enrollment changes throughout the duration of courses limited our ability to report on exact enrollment counts).

| Open-response version (Fall 2019 and Spring 2020)
The open-response version of the Eco-BLIC included open-response questions to gather student thinking in their own words. Similar to the instrument development process used for other undergraduate assessments, questions were iteratively revised for clarity, length, and scientific accuracy based on written responses from students (Adams & Wieman, 2011;Bass et al., 2016;Couch et al., 2015;NRC, 2012;Smith et al., 2008;Walsh et al., 2019).
We also conducted student think-aloud interviews to achieve cognitive validation because they are an effective way to provide "evidence that survey items are interpreted by participants in the same way the researcher intended before the instrument is administered to a large sample" (p. 2, Trenor et al., 2011). We recruited 12 introductory and advanced undergraduates in Spring 2020 for semistructured thinkaloud video-and audio-recorded interviews via Zoom (Marbach-Ad et al., 2009;Smith et al., 2008). Students were asked to think aloud and explain their reasoning as well as any points of confusion, and the results were used to inform improvements to the language, structure, and clarity of the instrument (Anders & Simon, 1980;Marbach-Ad et al., 2009;Smith et al., 2008). We generated the closed-response version of the Eco-BLIC in the same manner as the PLIC (see Walsh et al., 2019 for more detail regarding this process). For example, in developing the closed-response Eco-BLIC, we adopted similar question formats as the PLIC including using multiple and single response items (Table 3). We also incorporated students' wording from open response questions in creating the closed-response questions rather than introducing expert jargon or terminology for ease of comprehension.

TA B L E 2
Overview of Eco-BLIC development.
1. Identify common themes encountered in introductory undergraduate biology and ecology courses (e.g., predator-prey relationships) and conduct literature review to ensure content intended to be included in scenarios is scientifically accurate. lab-based research groups before the field-based research groups in each scenario. We found no significant difference (t-tests and ANOVA) in how students responded to questions when the ordering of scenarios was changed. Based on the results, we maintained the original question ordering (i.e., field-based followed by lab-based for the bass-mayfly scenario and lab-based followed by field-based for the owl-mouse scenario) in subsequent versions. We also iteratively used the feedback we received on these draft versions of the Eco-BLIC to clarify instructions and wording, add in missing elements, or remove questions and/or responses that were deemed unnecessary.
In a later draft version administered in Spring 2021, we also explored how students evaluated the quality of data in lab and field studies if individual evaluation questions were provided (i.e., questions that ask students to evaluate the strengths and weaknesses of different study features for each research group in a scenario individually). Ultimately, we found that students did not answer questions on the Eco-BLIC differently when the individual evaluation questions were present, and thus, we removed these questions in subsequent versions of the assessment (Heim et al., 2022).
Next, we conducted semistructured think-aloud interviews to explore question clarity on the revised assessment (Table 2). We used the same methods to achieve cognitive validation of the closedresponse version as we did the open-response version. Student participants spanned a range of biology concentrations (e.g., ecology and evolution, neurobiology, and physiology).
In Spring 2021, 20 experts provided feedback on the draft closed-response version of the Eco-BLIC. The experts were recruited through professional organization listservs and included biology and ecology professors, instructors/lecturers, and postdoctoral associates, from a wide array of institutions (e.g., 4-year institutions and community colleges). Experts were asked to both respond to the questions in the Eco-BLIC and offer written feedback on each page of the assessment (e.g., to note if wording was unclear, content was scientifically inaccurate, or the assessment duration was too long).
We used this feedback to develop the final version of the Eco-BLIC.

| Administration
We administered the final version of the Eco-BLIC to undergraduates across a range of institution types to confirm the utility of our instrument (Table 4). We sent participating instructors a survey link (Qualtrics, Provo, UT) to share with their students through course announcements, emails, and/or learning management systems and recommended that instructors provide credit for completing the Eco-BLIC to incentivize students. In general, students took between 20 and 30 min to complete the Eco-BLIC, which is administered online. Instructors assign the pre-Eco-BLIC to their students in the first 2 weeks of a course and the post-Eco-BLIC in the last 2 weeks of a course, either as an in-class or out-of-class assignment. Students are not informed of their pre-and post-test scores. Prior to conducting any statistical analyses, we excluded data in which students did not consent to have their responses used for research, were not 18 years of age or older, did not include their name (for pre-post test matching), and/or completed the assessment in less than 5 min.

| Scoring scheme
The Eco-BLIC scoring scheme was based on responses from 39 expert biologists ( Table 2). The experts were recruited mainly through professional organization listservs and included biology and ecology professors, instructors/lecturers, and postdoctoral associates, from a wide array of institutions (e.g., 4-year institutions and community colleges). Experts were asked to respond to the Eco-BLIC questions and also given the opportunity to offer written feedback on each page of the assessment (e.g., to note if wording was unclear). The suggestions from the experts were minimal, and we only made small wording adjustments based on their suggestions.
We adapted the Eco-BLIC scoring scheme from the scoring scheme developed for the PLIC (Walsh et al., 2019). Because expert responses suggested that there was no single correct answer for scored questions on the Eco-BLIC, an all-or-nothing scoring scheme (in which students would receive full credit for choosing a single correct response or no credit for choosing alternate responses) for scored questions would be inaccurate. Instead, the fraction of experts selecting each response served as an estimate of the relative correctness of each response choice.

Prompt comprehension questions
The multiple-choice prompt comprehension questions (Table 3) and an open-response question asking students to explain their reasoning for their initial hypothesis are not scored. These questions are to help orient students to the experimental scenario.

Research group comparison items
Research group comparison items (Table 3), an indicator of "what to trust" in our assessment, are included in the scoring scheme. All research group comparison items on the Eco-BLIC have a multiplechoice format, in which students can choose a single option from four possible responses. We assign values to each item based on the fraction of experts who chose that item out of the total number of experts who responded to that item. For example, Figure 1 shows the calculation when comparing the lab and field studies in the owl-mouse scenario for the Represented the predator appropriately item. We then added the scores for each item within the scenario to get the owlmouse research group comparison score. We apply the same scoring scheme to the bass-mayfly research group comparison questions.

Next steps items
Next steps items (the scored indicator of "what to do" in our assessment) are included in the scoring scheme as well. All next steps items on the Eco-BLIC have a multiple response format, in which students can choose up to three options from a list of responses (Table 3). We assign values to each item based on the fraction of experts who chose that item out of the total number of responses we received for that single question (rounded to the nearest tenth). For example, when evaluating next steps for the field study in the owl- Thus, if a student selects one, two, or three responses, they will score the maximum number of points if they select the one, two, or three highest valued responses, respectively. For example, when evaluating next steps for the field study in the owl-mouse scenario, 73% of experts reported that conducting statistical analyses was most important, followed by sampling mice from other fields (41%) and repeating the study to gather more data (30%). These percentages would translate to scores of 0.7, 0.4, and 0.3, respectively. In this example, using the V max equations outlined above, the maximum scores for choosing one, two, and three items are, respectively: A student who chooses only one option would need to select conducting statistical analyses (the top expert response) to receive a maximum score on this next steps question, while a student who chooses only two next steps items would need to select conducting statistical analyses and sampling mice from other fields (the top two expert responses) to receive a maximum score on this question. In a case where a student selects one of the top expert responses (e.g., conducting statistical analyses, V max1 ) and one of the nonexpert responses outside of V max1-3 (e.g., account for human error, chosen by 0% of experts), the student would receive points for choosing V max1 and would receive no points for choosing the nonexpert response. Therefore, students are not disproportionately penalized for selecting more or fewer responses on next steps questions using this scoring scheme. To normalize the next steps scores, we divide students' score by the total maximum expert score for the same number of selected responses (i.e., V max ).

Total Eco-BLIC score
The student's total score on the Eco-BLIC is obtained by summing the research group comparison scores (two subscores, one for the research group comparison items in the bass-mayfly scenario and one for the research group comparison items in the owl-mouse scenario) and the next steps question scores (four subscores, one for each lab and field group in the bass-mayfly and owl-mouse scenarios). Because there are only two subscores for research group comparison items compared with four for next steps questions, we multiplied the weight of each research group comparison subscore by two. Therefore, the maximum attainable score on the Eco-BLIC is eight points (four points from the research group comparison questions and four points from the next steps questions). While our scoring scheme is based on fractions of points, below we report scores as percentages for comparison purposes.
F I G U R E 1 Sample scoring scheme for research group comparison questions. Data shown is for the item "Represented the predator appropriately." All statistical comparisons discussed below (i.e., when a p-value is reported) are based on either t-tests (for comparisons of two groups) or ANOVAs (for comparisons of more than two groups).

| FINDING S
Below, we describe findings from different analyses investigating the reliability of the Eco-BLIC, including test and question difficulty, question discrimination, internal consistency and question-test correlations, test-retest reliability, and concurrent validity. Interquartile range is abbreviated as IQR. Note that we report scores as percentages for comparison purposes.

| Test and question difficulty
The average total score on the final version of the Eco-BLIC (n = 1103 student responses) was 65% for both the pretest and post-test ( Figure 2a). There was no significant difference in total scores among student responses (p > .05).
We also examined average total scores across question types.
The average total score across the research group comparison questions was 77% on the pretest and 76% on the post-test (Figure 2b).
The average total score across next steps questions was 54% on the pretest and 53% on the post-test (Figure 2c). Similar to the total score data, there was no significant difference in research group comparison or next steps scores among student responses on the pre-and post-tests (p > .05), both for pooled responses and for matched data. While the research group comparison scores were negatively skewed (i.e., the distribution leans to the right), the total and next steps scores followed a more normal distribution. Thus, we report median and IQR along with means in our results and figure legends, to allow for more meaningful comparisons.
The average score per question ranged from 50% to 78% on the pretest and from 50% to 77% on the post-test (Figure 3), within the acceptable range noted in prior instrument validation studies (Ding & Beichner, 2009;Doran, 1980). We did not find any significant differences between item difficulty for each question between pre-and post-test responses (p > .05). The average mean scores for the research group comparison questions were higher (75%-78%) than the average scores for the next steps questions (50%-58%; Figure 3). While we did not find significant differences between preand post-test total, research group comparison, or next steps scores with the aggregate data (n = 1103 responses), we measured significant changes in pre-and postscores in three individual courses (two general and one advanced ecology courses).

| Question discrimination
We used question-test correlations (i.e., correlations between students' total scores on the Eco-BLIC and their scores on individual questions) to explore how well each question discriminated between low-and high-performing students.  and confirmatory factor analysis (CFA) using student responses from

| Internal consistency
Spring 2022 (i.e., the second set of responses on the final version of the Eco-BLIC). EFA is used to "explore the possible underlying factor structure of a set of observed variables without imposing a preconceived structure on the outcome" while CFA is used to "test the hypothesis that a relationship between observed variables and their underlying latent constructs exist" (p. 1, Suhr, 2006). We chose to conduct factor analysis in this way to first establish and define the constructs of our instrument without preconceived expectations and then to confirm that the patterns we found in our Fall 2021 dataset were sound.
We conducted the EFA with oblique rotation-or rotating the axes during factor analyses at an angle other than 90 degrees to improve the interpretation of factor loadings (Suhr, 2006)-as this adjustment does not assume independence of student responses. We found that questions loaded primarily onto two factors that cumulatively explained nearly 35% of the variance of students' scores on the six primary Eco-BLIC questions. After analyzing factor loadings, we de-

| Test-retest reliability
Test-retest reliability of an instrument is usually measured by having the same respondents complete the assessment multiple times under the same conditions. As longitudinal administration of the

| Concurrent validity
We analyzed two forms of concurrent validity-"a measure of the consistency of performance with expected results" (p. 10, Walsh et al., 2019)-for our instrument. First, we compared question scores on the pretest between students in introductory courses (n = 582), students in advanced courses (n = 55), and experts (n = 39), with the expectation that experts would have higher scores than students.
To further parse out these patterns, we explored differences in research group comparison scores and next steps scores between these three groups. When comparing the research group comparison scores between introductory students, advanced students, and experts, there were no significant differences between groups (p > .05; Figure 4). We also found that while introductory and advanced students' next steps scores did not differ (p = .75), experts' next steps scores were significantly higher than those of introductory students (p < .001) and advanced students (p < .001; Figure 4).
The second form of concurrent validity analyzed how scores differ based on students' prior research experience. We expected that students with more research experience would be more likely to have higher scores on the Eco-BLIC because of their familiarity with the scientific process. We compared question scores on the pretests between students with no research experience (n = 454) and some research experience (i.e., one or more terms; n = 179). In the question, we defined a term as a semester, quarter, or summer session, and the survey indicated that research experiences should have been supervised by a faculty mentor.
We did not find any differences in research group comparison scores across students with varying research experience ( Figure 5a).
However, we found that next steps scores were significantly higher for students with some research experience than for students with no research experience (p = .0006; Figure 5b).

F I G U R E 5
Scores separated by the amount of student research experience for (a) research group comparison questions and (b) next steps questions. Some signifies one or more terms of research experience. Horizontal lines represent the median and lower and upper quartiles, while dots indicate outliers. Next steps question scores for students with no research experience (mean = 53%, median = 51%, IQR = 22%, n = 454) compared with some research experience (mean = 57%, median = 57%, IQR = 21%, n = 179). **Indicates significance (p < .001).

| Students' Eco-BLIC scores did not change over time
The overall lack of change from pre-to postscores across a range of biology and ecology courses in our study emphasizes a possible misalignment in student outcomes and learning activities and assessments in the classroom. Further, this observation suggests that, although ranked as one of the most important and necessary outcomes of undergraduate degree programs (Gencer & Dogan, 2020;Murawski, 2014), critical thinking about experiments may not be commonly developed in the biology and ecology classroom (Fox & Hackerman, 2003;Handelsman, 2004). Bissell and Lemons (2006) attribute the challenges of incorporating pedagogical techniques that aim to improve critical thinking skills to (1) a lack of one common definition of critical thinking and (2) a limited number of instruments available to measure and assess critical thinking in the classroom. We encourage instructors to leverage the Eco-BLIC as a tool for measuring and assessing their critical thinking to better align instruction with critical thinking learning outcomes.
However, we also note that we did measure significant increases in pre-to postscores on the Eco-BLIC in three participating courses (with critical thinking gains in several other courses approaching significance), which suggests that the Eco-BLIC can measure changes in students' critical thinking in individual courses with unique instructors.
An important next step is to describe the learning activities, assessments, and learning objectives for each course and instructor to better understand alignment of these classroom aspects with critical thinking.
Additionally, integrating similar critical thinking instructional activities across participating courses and giving the Eco-BLIC pre and post could shed light on what instructional components influence critical thinking gains in undergraduate biology.
As further evidence of the lack of development in critical thinking over time, we also found that introductory and advanced students did not differ in their Eco-BLIC scores (Figure 4). Though it may seem intuitive that students in advanced courses would have more critical thinking skills than students in introductory courses and thus exhibit more expert-like thinking when evaluating what to trust and what to do, this was not the case. Quitadamo and Kurtz (2007) also noted the disconnect between faculty expectations of senior undergraduates' critical thinking and their students' performance on critical thinking assessments (AACU, 2005). If students in introductory biology and ecology courses are not gaining critical thinking skills, and instructors of advanced courses are assuming that students already gained these skills earlier, the opportunity to actually gain these skills may never have occurred.

| Students demonstrate less expert-like thinking when deciding what to do
We consistently observed that while students think similarly to experts in evaluating what to trust (i.e., the research group comparison questions), students' responses were less expert-like when deciding what to do (i.e., next steps questions; Figure 4). Walsh et al. (2019) found a similar pattern in physics scenarios using the PLIC, suggesting this result is not unique to ecology. Notably, students with at least one term of research experience scored significantly higher on the next steps questions than students who reported having no research experience ( Figure 5b). This result supports our hypothesis that students with more research experience would have higher scores on the Eco-BLIC because of their familiarity with the scientific process. If students have more authentic experience in making decisions about what to do next in their research (e.g., troubleshooting and proposing future directions for their project), it seems reasonable that they would be more likely to apply those skills on the Eco-BLIC and thus score higher on the next steps questions compared with students with no research experience.
The discrepancy in students' abilities to decide what to do next could potentially be ameliorated by engaging students in undergraduate research opportunities to enhance critical thinking skills (Juanda, 2022). Gaining critical thinking skills has frequently been reported as a primary benefit for students participating in undergraduate research experiences (Bhattacharyya et al., 2018;Helix et al., 2022;Seifan et al., 2022;Seymour et al., 2004). Now that the Eco-BLIC is available to assess students' critical thinking and students with research experience show more expert-like thinking in evaluating what to do next, we should explore how to bring these skills to all students in our courses.
While we found that faculty-mentored research experiences are helping students to gain necessary critical thinking skills ( Figure 5), it is not practical for all undergraduates to partake in research experiences led by faculty (Wei & Woodin, 2011). One option is to bring inquiry and discovery lab-based experiences to the classroom through course-based undergraduate research experiences (CUREs), which have been found to encourage students' critical thinkingspecifically their evaluation of what to do next in experimental scenarios . CUREs also provide opportunities for students to think like expert scientists (Brownell & Kloser, 2015) and may promote iteration and thinking about next steps in the face of research failures (Gin et al., 2018), which is important to consider given that ecological data can be messy and unpredictable. Future work should disentangle whether students who take courses with opportunities to design their own authentic experiments see comparable gains in scores on the next steps items as students engaging in undergraduate research.
In addition to lab experiences, learning experiences in the lec-

| Limitations
To make this instrument easy for instructors to implement and score, we used a closed-response format. However, this format can be limit-

| Conclusions
We collected validity and reliability evidence for the Eco-BLIC, which demonstrates that it can be used to measure critical thinking across a range of biology and ecology courses to better understand how students evaluate both what to trust and what to do. Through assessing concurrent validity, we found that students demonstrate less expert-like thinking when deciding what to do and that students in introductory and advanced courses do not differ in their critical thinking skills. Further, while students' Eco-BLIC scores did not change over time, students with some amount of research experience had more expert-like thinking on next steps questions compared with students who had no experience. The results indicate that instructors may wish to reflect on the alignment of their critical thinking-related course learning outcomes and activities, deliberately design or adapt course materials to provide opportunities for students to gain critical thinking skills-particularly those focused on evaluating what to do next-and measure their students' critical thinking using instruments like the Eco-BLIC. Currently, instructors of participating courses receive a summary report of their students' Eco-BLIC scores from the research team to interpret critical thinking changes across the duration of their courses. In the future, the Eco-BLIC will have an automated scoring system that will give instructors access to a more detailed breakdown of the scoring. Instructors funding acquisition (equal); methodology (equal); supervision (equal); validation (equal); writing -review and editing (equal). Michelle K.

ACK N OWLED G M ENTS
We thank the members of the Cornell Discipline-based Education Research group for their feedback on this article, as well as Matt

This work was supported by a National Science Foundation grant (DUE-1909602) and a National Science Foundation Graduate
Research Fellowship (DGE-2139899). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors have declared that no competing interests exist.

DATA AVA I L A B I L I T Y S TAT E M E N T
The datasets referenced in this article are not readily available because the approved study protocol and consent form explicitly state that this sensitive human subject data will be confidentially protected and will not be shared publicly due to the personal

A.2 | GROUP 1'S STUDY
We conducted surveys of nearby 30 ponds: 15 ponds that contained smallmouth bass and 15 ponds that contained no smallmouth bass.
Below you will find a picture of one of Group 1's study sites.
We measured 10 young mayflies within each pond and calculated the mean length of mayflies for each. We found the following pattern: Reminder: Each circle equals the mean (average) mayfly length per pond.
What do you think Group 1 should say about the feeding pattern between smallmouth bass and mayflies?
Smallmouth bass selectively consume smaller mayflies.
Smallmouth bass selectively consume larger mayflies.
Smallmouth bass consume mayflies with no size preference.
There is not enough evidence to determine the feeding pattern.
Please explain your reasoning in the space below.
_________________________________________________________ _______ How effective was Group 1, overall, in testing whether smallmouth bass selectively feed on larger or smaller mayflies?

A.3 | GROUP 2'S STUDY
We collected smallmouth bass and young mayflies from a single pond. We established ten tanks and placed 100 mayflies in each tank. We placed one smallmouth bass in each tank for 24 h. The tanks are covered with a net, so the mayflies cannot move to different tanks.
Below is a picture of Group 2's laboratory setup.
We took a random sample of 20 mayflies from each tank before and after the 24-hour period. We calculated the mean (average) length of mayflies in each sample. We found the following pattern: How do you think Group 1 and Group 2 performed in the following categories?
Group 1 was more effective What do you think Group 2 should say about the feeding pattern between smallmouth bass and mayflies?
Smallmouth bass selectively consume smaller mayflies.
Smallmouth bass selectively consume larger mayflies.
Smallmouth bass consume mayflies with no size preference.
There is not enough evidence to determine the feeding pattern.
Please explain your reasoning in the space below.

________________________________________________________
How effective was Group 2, overall, in testing whether smallmouth bass selectively feed on larger or smaller mayflies?
Ineffective (1) 2 In these two studies on the feeding patterns of smallmouth bass and mayflies, we told you that "two groups of biologists" were carrying out the research. Who did you picture when you were thinking of the

A.4 | GROUP 1'S STUDY
We trapped 10 mice from multiple nearby fields. We brought them into the laboratory and set up two cages each containing five mice and a rodent nest box where mice can hide, burrow, and sleep. A bowl with a large amount of seeds was placed outside the nest. We placed infrared cameras in the cages to record mouse behavior over one night and watched the video to determine the time the mice spent at the food bowl.
One mouse cage was placed in Room 1 and one mouse cage was placed in Room 2. In Room 1, mouse behavior was recorded as they moved in and out of the nest box. In Room 2, we played 30-s greathorned owl calls every 15 min and recorded mouse behavior as they moved in and out of the nest box.
Below are pictures of the laboratory setup in Rooms 1 and 2.
We calculated the mean (average) amount of time the five mice in each room spent at their food bowl and found the following pattern:

Note: Error bars indicate standard deviation
What do you think Group 1 should say about the feeding behavior of mice while great-horned owl calls play?
Mice spend less time at the food bowl in the presence of an owl predator call.
Mice spend more time at the food bowl in the presence of an owl predator call.
Mice spend the same amount of time as they usually do at the food bowl in the presence of an owl predator call.
There is not enough evidence to determine mouse feeding behavior.
Please explain your reasoning for your choice in the space below: _________________________________________________________ _______ How effective was Group 1, overall, in testing the feeding behavior of mice while great-horned owl calls play?
Ineffective (1)   2   3 Effective (4) What should Group 1 do next? (Select up to 3 options total.) Redesign the study to run for a longer period of time The two groups of biologists want to know how the presence of a great-horned owl influences the amount of time that mice spend feeding.

A.5 | GROUP 2'S STUDY
We set up 15 cages in an outdoor enclosure. Each cage has one mouse and a rodent nest box where mice can hide, burrow, and sleep. The mice were trapped from a single nearby field. A bowl with a large amount of seeds was placed outside the nest.
We conducted our study across two nights. We placed infrared cameras in the cages to record mouse behavior throughout these nights and watched the video to determine the time the mice spent at their food bowls.
On night one, mouse behavior was recorded in the absence of a predator.
On night two, we placed one great-horned owl in the outdoor enclosure to measure mouse behavior in the presence of a predator. The owl was able to freely fly around the enclosure and could rest in trees, but it could not access the caged mice. The mice were able to view, smell, and hear the owl.
Below is a picture of Group 2's fieldsite and an example of one of the cages.
We determined the total amount of time each mouse spent at their food bowl per night and found the following pattern: Reminder: Each circle equals the total amount of time a mouse spent at the food bowl.
How do you think Group 1 and Group 2 performed in the following categories?
Group 1 was more effective What do you think Group 2 should say about the feeding behavior of the mice in the presence of a great-horned owl?
Mice spend less time at the food bowl in the presence of an owl predator.
Mice spend more time at the food bowl in the presence of an owl predator.
Mice spend the same amount of time as they usually do at the food bowl in the presence of an owl predator.
There is not enough evidence to determine mouse feeding behavior.
Please explain your reasoning for your choice in the space below: _________________________________________________________ _______ How effective was Group 2, overall, in testing the feeding behavior of mice in the presence of a great-horned owl?
Ineffective (1)   2   3 Effective (4) What should Group 2 do next? (Select up to 3 options total.) Redesign the study to run for a longer period of time Provided an adequate graph/data representation In these two studies on the feeding behavior of mice in the presence or absence of owls, we told you that "two groups of biologists" were carrying out the research. Who did you picture when you were thinking of the "biologists"?

Expert scientists
Other (Please describe in box):

A.6 | DEMOGRAPHIC QUESTIONS
Instructions: Please answer the following questions to the best of your ability. Your answers will be used to better understand the characteristics of students taking this survey.
Please indicate how well you agree with the following statements: Strongly disagree

Somewhat disagree
Neither agree nor disagree Somewhat agree

Strongly agree
When I was younger, I spent a lot of time in natural areas (e.g., exploring and hiking). Now, I spend a lot of time in natural areas (e.g., exploring and hiking).
I have prior experience conducting field research (collecting data outdoors on natural phenomena).
I want more experience conducting field research (collecting data outdoors on natural phenomena).
Have you participated in undergraduate research as part of a faculty member's research group? If so, for how many terms (term = 1 semester or 1 quarter or 1 summer)? Prefer not to disclose